{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Question Answering Using Milvus, Towhee and Hugging Face\n",
    "In this notebook we go over how to search for the best answer to questions using Milvus as the Vector Database, Towhee as the pipeline, and Hugging Face as the model."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Packages\n",
    "We first begin with importing the required packages. In this example, the only non-builtin packages are datasets, towhee and pymilvus. Datasets is the Hugging Face packages to load in the data, towhee is the pipelining application, and pymilvus is the client for Zilliz Cloud. If not present on your system, these packages can be installed using `pip install towhee datasets pymilvus`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/filiphaltmayer/miniconda3/envs/openai/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n"
     ]
    }
   ],
   "source": [
    "# Used to stop annoying error cause by forking\n",
    "import os\n",
    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
    "\n",
    "from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility\n",
    "from towhee.dc2 import pipe, ops, DataCollection\n",
    "from datasets import load_dataset_builder, load_dataset, Dataset"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameters\n",
    "Here we can find the main parameters that need to be modified for running with your own accounts. Beside each is a description of what it is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATASET = 'squad'  # Huggingface Dataset to use\n",
    "MODEL = 'bert-base-uncased'  # Transformer model to use\n",
    "INSERT_RATIO = .001  # What percentage of dataset to embed and insert\n",
    "COLLECTION_NAME = 'huggingface_db'  # Collection name\n",
    "DIMENSION = 768  # Embeddings size\n",
    "LIMIT = 10  # How many results to search for\n",
    "HOST = 'localhost'  # Milvus IP\n",
    "PORT = '19530'  # Milvus Port"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Milvus\n",
    "This segment deals with Milvus and setting up the database for this use case. Within Milvus we need to setup a collection and index the collection. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Connect to Milvus Database\n",
    "connections.connect(host=HOST, port=PORT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove collection if it already exists\n",
    "if utility.has_collection(COLLECTION_NAME):\n",
    "    utility.drop_collection(COLLECTION_NAME)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create collection which includes the id, title, and embedding.\n",
    "fields = [\n",
    "    FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True),\n",
    "    FieldSchema(name='original_question', dtype=DataType.VARCHAR, max_length=1000),\n",
    "    FieldSchema(name='answer', dtype=DataType.VARCHAR, max_length=1000),\n",
    "    FieldSchema(name='original_question_embedding', dtype=DataType.FLOAT_VECTOR, dim=DIMENSION)\n",
    "]\n",
    "schema = CollectionSchema(fields=fields)\n",
    "collection = Collection(name=COLLECTION_NAME, schema=schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create an AutoIndex index for collection.\n",
    "index_params = {\n",
    "    'metric_type':'L2',\n",
    "    'index_type':\"IVF_FLAT\",\n",
    "    'params':{\"nlist\":1536}\n",
    "}\n",
    "collection.create_index(field_name=\"original_question_embedding\", index_params=index_params)\n",
    "collection.load()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Insert Data\n",
    "Once we have the collection setup we need to start inserting our data. This is done in three steps: tokenizing the original question, embedding the tokenized question, and inserting the embedding, original question, and answer. In our example we use Hugging Face Datasets to load in the dataset and then feed that into a Towhee pipeline for embedding and inserting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-02-10 15:02:46,071 - 140704288082112 - builder.py-builder:785 - WARNING: Found cached dataset squad (/Users/filiphaltmayer/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)\n",
      "100%|██████████| 99/99 [00:00<00:00, 2055.91ex/s]\n"
     ]
    }
   ],
   "source": [
    "data_dataset = load_dataset(DATASET, split='all')\n",
    "data_dataset = data_dataset.train_test_split(test_size=INSERT_RATIO)['test']\n",
    "# Clean up the data structure in the dataset.\n",
    "data_dataset = data_dataset.map(lambda val: {'answer': val['answers']['text'][0]}, remove_columns=['answers'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-02-10 15:02:46.444223: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA\n",
      "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.\n"
     ]
    }
   ],
   "source": [
    "# The inserting pipeline\n",
    "p = (\n",
    "    pipe.input('question', 'answer')\n",
    "        .map('question', 'embedding', ops.text_embedding.transformers(model_name='bert-base-uncased'))\n",
    "        .map('embedding', 'cls_embedding', lambda x: x[0])\n",
    "        .map(\n",
    "            ('question', 'answer', 'cls_embedding'),\n",
    "            (), \n",
    "            ops.ann_insert.milvus_client(\n",
    "                host=HOST, \n",
    "                port=PORT, \n",
    "                collection_name=COLLECTION_NAME,\n",
    "            )\n",
    "        )\n",
    "        .output()\n",
    ")\n",
    "\n",
    "for x in data_dataset:\n",
    "    p(x['question'], x['answer'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Question:\n",
      "What was Orsini known for?\n",
      "Answer, Distance, Original Question:\n",
      "['Napoleon III', 13.175783157348633, 'Who did Orsini try to assassinate?']\n",
      "['retracted his decision', 24.993133544921875, 'What did Nasser do after mass demonstrations?']\n",
      "['1865', 25.8225154876709, 'When did Palmerston die?']\n",
      "['US$320,000,000', 26.799659729003906, 'How much money did Nasser spend on weapons?']\n",
      "['bid for statehood', 27.64345932006836, 'Why was this constitutional convention held?']\n",
      "['Oba Kosoko', 27.844146728515625, 'Which Lagos king had supported the slave trade?']\n",
      "['2001', 28.558818817138672, 'When was the Doha Declaration adopted?']\n",
      "['James River and Kanawha Canal', 29.44527816772461, 'What man-made body of water was designed in part by George Washington?']\n",
      "['in a state of self-sacrifice', 29.550865173339844, 'How did the Cathars live?']\n",
      "['1748 with the signing of the Treaty of Aix-la-Chapelle', 29.62041473388672, 'What was the end of the War of the Austrian Succession?']\n",
      "\n",
      "Question:\n",
      "What does the finding of gold cause?\n",
      "Answer, Distance, Original Question:\n",
      "['gold rush', 17.19445037841797, 'What did the finding of gold in Victoria cause?']\n",
      "['Dirección Nacional de Transporte', 20.836410522460938, 'What does DNT stand for?']\n",
      "['Community Based Natural Resource Management', 21.718856811523438, 'What does CBNRM stand for?']\n",
      "['rendering of the Old Testament into Greek', 22.101085662841797, 'What is one of the first known instances of translation in the West?']\n",
      "['James River and Kanawha Canal', 23.819223403930664, 'What man-made body of water was designed in part by George Washington?']\n",
      "['when transporting large goods or moving groups of people between certain floors', 24.325286865234375, 'At what times is Independant service best utilized?']\n",
      "['obligations established by agreement (express or implied) between private parties', 25.668529510498047, 'What is contract law?']\n",
      "['an earthquake', 25.91346549987793, 'What occurrence is measured by a seismometer?']\n",
      "['Neo-Historicism/Revivalism, Traditionalism', 25.926233291625977, 'What is another name for New Classical Architecture?']\n",
      "['high', 26.297353744506836, 'What do dysfunctional people perceive the severity of their pain to be?']\n"
     ]
    }
   ],
   "source": [
    "# The searching pipeline\n",
    "search_p = (\n",
    "    pipe.input('question')\n",
    "    .map('question', 'embedding', ops.text_embedding.transformers(model_name='bert-base-uncased'))\n",
    "    .map('embedding', 'cls_embedding', lambda x: x[0])\n",
    "    .flat_map('cls_embedding', ('id', 'score', 'original_question','answer'), ops.ann_search.milvus_client(host=HOST, port=PORT, collection_name=COLLECTION_NAME, output_fields=['original_question', 'answer'], limit = 10))\n",
    "    .output('answer', 'score', 'original_question')\n",
    ")\n",
    "\n",
    "search_questions = ['What was Orsini known for?', 'What does the finding of gold cause?']\n",
    "for x in search_questions:\n",
    "    res = search_p(x)\n",
    "    print()\n",
    "    print('Question:')\n",
    "    print(x)\n",
    "    print('Answer, Distance, Original Question:')\n",
    "    while(True):\n",
    "        answer = res.get()\n",
    "        if answer:\n",
    "            print(answer)\n",
    "        else:\n",
    "            break"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "openai",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "20e2a1b77e7395ec3f747af99b3084257dd9a83ab453e4f5fc77b9434eecfeb0"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
